Review fixes for #715: drop scope creep, dead code, redundant test; dedup row-major LDE#741
Merged
Merged
Conversation
The FxHasher/FxHashMap op-dedup micro-optimization is unrelated to the row-major GPU LDE rework and was only applied to 4 of 6 dedup tables. Revert the table maps to std HashMap and drop the hasher; it can land as its own focused PR.
The GPU full path is covered by the normal prove/verify suite built with --features cuda (plus gpu_path_fires_end_to_end), the CPU path by the non-cuda suite, and GPU/CPU equivalence by the merkle/barycentric parity tests. Its force-CPU leg also never ran on CPU: gpu_lde_threshold() only re-read the env var under cfg(test), but from the prover integration crate stark compiles without cfg(test), so the OnceLock cached the first value. Simplify gpu_lde_threshold() to a single cached impl now that the per-call re-read has no consumer.
keccak.cu: move keccak256_leaves_base_row_major out of keccak_merkle_level's doc block so the child-pair->parent doc rejoins its kernel. prover.rs: delete columns_to_row_major, which has no callers after the row-major GPU path stopped materializing GPU-expanded columns.
Extract coset_lde_row_major_inner shared by the base and ext3 _keep entry points (they differed only by m vs m*3 and the handle type), removing ~110 lines of drift-prone duplication. Add debug_assert!(num_rows >= 2) to launch_keccak_base_row_major: the kernel shifts by (64 - log_num_rows), UB at num_rows==1, matching the guard in launch_keccak_base.
The assert checked gpu_parts_lde_calls() > 0 with a comment claiming branch/shift tables are degree-3 — both false: fib_iterative_1M tables all have number_of_parts <= 2, and the common degree-2 case fires the fused two-halves path (gpu_extend_halves_calls), counted separately from the parts>2 path (gpu_parts_lde_calls) since #700. Assert on the sum so either composition-LDE path satisfies it. Validated on RTX 5090 / CUDA 13.1: make test-math-cuda 78/78, make test-cuda-integration green, proof verifies.
jotabulacios
approved these changes
Jun 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Review fixes for #715 (row-major GPU LDE). Targets the PR branch so they fold into #715. Net cleanup, −~150 lines.
Changes
FxHashMapop-dedup change (scope creep). TheFxHasher/FxHashMapmicro-optimization intypes.rs+branch/dvrm/lt/mul.rsis unrelated to the LDE rework and was only applied to 4 of 6 dedup tables (eq.rs/bytewise.rsleft on stdHashMap). Reverted tostd::collections::HashMap; it can land as its own focused PR.gpu_and_cpu_proofs_both_verifytest. The GPU full path is covered by the normal prove/verify suite built with--features cuda(plusgpu_path_fires_end_to_end), the CPU path by the non-cuda suite, and GPU/CPU equivalence by themerkle_root_parity/barycentric_cpu_gpu_paritytests. Its "force CPU" leg also never ran on CPU:gpu_lde_threshold()only re-read the env var under#[cfg(test)], but from theproverintegration-test cratestarkcompiles withoutcfg(test), so theOnceLockcached the first value and the secondset_varwas a no-op. Also simplifiedgpu_lde_threshold()to a single cached impl now that the per-call re-read has no consumer.keccak256_leaves_base_row_majorout ofkeccak_merkle_level's doc block (it had split the "child pair → one parent" comment away from its kernel). Deletedcolumns_to_row_majorinprover.rs— zero callers after the row-major path stopped materializing GPU-expanded columns.debug_assert!(num_rows >= 2)tolaunch_keccak_base_row_major; the kernel computes__brevll(tid) >> (64 - log_num_rows), which is UB atnum_rows == 1. Matches the existing guard inlaunch_keccak_base.coset_lde_row_major_inner, shared by the base and ext3_keepentry points (they differed only bymvsm * 3and the handle type), removing ~110 lines of drift-prone duplication.gpu_path_fires_end_to_end. It assertedgpu_parts_lde_calls() > 0with a comment claiming branch/shift tables are degree-3 — both false:fib_iterative_1Mtables all havenumber_of_parts <= 2, and the common degree-2 case fires the fused two-halves path (gpu_extend_halves_calls), counted separately from theparts > 2path since perf(stark): fuse composition half-extension onto coset_lde_full (precomputed twiddles) #700. Now asserts on the sum so either composition-LDE path satisfies it.Verification
GPU-validated on RTX 5090 / CUDA 13.1:
make test-math-cuda→ 78/78 passed, includingmerkle_root_parity(base + ext3) andbarycentric_cpu_gpu_parity— confirms thecoset_lde_row_major_innerconsolidation is behavior-preserving.make test-cuda-integration→ green (gpu_path_fires_end_to_end: GPU dispatches fire, proof verifies).make bench-prover-cuda→prove(fib_iterative_1M)= 5.89s, GPU LDE path fired.Host checks:
cargo check+cargo clippy -p lambda-vm-proverclean; all edited Rust filesrustfmt-clean; thekeccak.cuchange is a pure reorder (added lines == removed lines, no logic change).